Skip to content

Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)#909

Open
sunnypatneedi wants to merge 26 commits intoopenai:mainfrom
sunnypatneedi:submission/v10-moonshot-0.8609
Open

Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)#909
sunnypatneedi wants to merge 26 commits intoopenai:mainfrom
sunnypatneedi:submission/v10-moonshot-0.8609

Conversation

@sunnypatneedi
Copy link
Copy Markdown

@sunnypatneedi sunnypatneedi commented Mar 26, 2026

11-gram Eval Cache + Hedge Mixer on PR #549 Base

val_bpb: 0.8609 (3-seed mean, std 0.0008, sliding window stride=64) | ~15.9 MB | 8×H100 SXM

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed step_avg steps Roundtrip bpb Sliding+N-gram bpb N-gram gain Eval time Artifact
42 92ms ~6,500 1.1452 0.8600 -0.2852 ~188s 15,341,541
1337 92ms ~6,500 1.1452 0.8611 -0.2841 ~188s 15,918,565
2025 92ms 6,526 1.1452 0.8616 -0.2836 188s 15,790,804
Mean 92ms ~6,500 1.1452 0.8609 (std 0.0008) -0.284 ~188s

Key Innovation: 11-gram Eval Cache with Entropy-Adaptive Mixing

The n-gram eval cache provides -0.284 bpb — the single largest improvement over the base model. It replaces TTT entirely, freeing the full eval time budget.

  1. Multi-order n-gram cache (orders 2-11): 10 hash tables with 4M buckets each, uint32 count tables
  2. Score-first, update-after protocol: n-gram counts are scored before being updated (legal per @valerio-oai, Issue ⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140)
  3. Entropy-adaptive alpha: mixing weight between neural and n-gram predictions is a function of model entropy — high-entropy (uncertain) tokens get more n-gram contribution
  4. Order-adaptive gating: higher-order matches get tighter entropy thresholds via order_centers = 3.0 - 0.25 * (matched_order - min_order)
  5. Hedge Mixer: online multiplicative-weights ensemble (beta=2.0) that learns optimal neural vs n-gram weighting across the eval run

N-gram Protocol

  1. Initialize 10 hash tables (orders 2-11), each with 4M buckets of uint32 counts
  2. For each evaluation position:
    • Score: look up n-gram match for each order (highest order first), compute n-gram probability
    • Compute model entropy from neural logits
    • Compute entropy-adaptive alpha (sigmoid of entropy vs order-specific threshold)
    • Hedge Mixer blends neural and n-gram-enhanced predictions using learned weights
    • Update: increment n-gram counts for all observed n-grams at this position
  3. Sliding window eval (stride=64) processes validation tokens with the n-gram cache active

Run Config

cd /workspace/parameter-golf
SEED=42 BIGRAM_VOCAB_SIZE=0 VE_DIM=64 GRADQUANT_ENABLED=0 \
  torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-03-26_sunnypatneedi_moonshot/train_gpt.py

All hyperparameters are baked into the script as defaults. Key environment variables:

# N-gram config
NGRAM_CACHE=1 NGRAM_ORDER=11 NGRAM_MIN_ORDER=2 NGRAM_BUCKETS=4194304
NGRAM_ENTROPY=1 NGRAM_ALPHA=0.40 NGRAM_ENT_BASE=0.05 NGRAM_ENT_RANGE=0.55

# Hedge Mixer
HEDGE_ENABLED=1 HEDGE_BETA=2.0

# Model (no BigramHash, VE_DIM=64 to fit 16MB across all seeds)
BIGRAM_VOCAB_SIZE=0 VE_DIM=64 GRADQUANT_ENABLED=0

# TTT disabled (n-gram replaces it)
TTT_ENABLED=0

Timing Budget

Phase Time
Training 600s (≤10 min)
Int6 roundtrip eval (diagnostic) ~49s
Sliding window + n-gram + Hedge eval (stride=64) ~188s
Total eval ~237s (< 10 min)

Training Architecture (from PR #549 SOTA)

Component Setting
Layers 11 (512d, 8H, 4KV GQA)
MLP 3× expansion, LeakyReLU(0.5)²
XSA All 11 layers
Gated Attention Enabled
RoPE Partial (16/64 dims)
LN Scale 1/√(layer+1)
VE64 Layers 7-10
Weight avg EMA(0.997) + SWA(every 50)
Quantization Uniform Int6 + zstd-22

Ablation

Config val_bpb Delta
Roundtrip (no n-gram, no sliding window) 1.1452 — (baseline)
+ Sliding window (stride=64) + 11-gram + Hedge 0.8609 -0.284

Credits

sunnypatneedi and others added 24 commits March 24, 2026 10:48
Two-phase TTT pipeline (novel combination):
- Phase 1: In-Place TTT — updates MLP output projections per-document (ICLR 2026)
- Phase 2: Per-doc LoRA TTT — adapts Q/V/LM head with surprise gating (top-K tokens)

Architecture: PR openai#486 base (11L, TrigramHash, ValueResidual, GradQuant) +
LeakyReLU(0.5)^2 + eval-only XSA on all layers + Partial RoPE + LN Scale

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- gptq_calibrate(): collect Hessian H=X^TX via forward hooks on training data
- gptq_quantize_weight(): column-wise int6 with Cholesky error compensation
- _find_best_row_scales(): percentile search for optimal per-row scales
- Integrated into mixed_quantize_int6() — falls back to naive when no Hessian
- Expected: -0.0026 bpb from better quantization alone (PR openai#535 ablation)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug 1: Function adapted MLP weights but never scored documents.
  All compute was wasted — no loss/bpb accumulation.
  Fix: Rewrote as inplace_ttt_eval() with apply-then-update loop:
  score chunk first (accumulate bpb), then gradient-update MLP proj.

Bug 2: Model left in last document's adapted state after function.
  This corrupted subsequent LoRA TTT evaluation.
  Fix: Reset MLP weights to original after all documents.

Also: Made In-Place TTT and LoRA TTT alternatives (config switch)
rather than sequential phases, since both produce val_bpb scores.
Use INPLACE_TTT_ENABLED=1 for In-Place, =0 for LoRA TTT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 1 results:
- Artifact 16.35MB (352KB over 16MB limit) — caused by GradQuant int7
- LoRA TTT took 1572s (2.6x over 600s budget) — 20 epochs too many
- Pre-quant val_bpb: 1.1757 (46 shards, not full 80)
- Post-quant sliding window: 1.3569

Fixes:
- GradQuant: top-10% sensitivity stays int6 (not int7)
- TTT epochs: 20 → 5 (should complete in ~400s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 1 showed:
- Pre-quant val_bpb: 1.1757
- Post-quant sliding window: 1.3569
- Quantization penalty: 0.18 bpb (expected ~0.003)

Root cause: Our GPTQ implementation (ported from PR openai#535) produced
WORSE quantization than standard per-row int6. PR openai#486 base doesn't
use GPTQ at all. Possible issues: bad Hessian calibration, numerical
instability in Cholesky decomposition, or name mismatch between
hooks and state dict keys.

Fix: Disable GPTQ, revert to standard quantization path.
GPTQ code preserved for future debugging.

Also confirmed: TTT bpb formula is algebraically correct.
The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 0: PR openai#548 UNMODIFIED (1.0865 proven). Reproduce baseline.
Run 1: PR openai#548 + LeakyReLU(0.5)^2 (1 line change). Measure delta.

Following retro lesson: baseline first, one change at a time.
No GPTQ, no In-Place TTT, no XSA, no surprise gating.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… PR openai#548

Run 0: PR openai#414 UNMODIFIED (merged SOTA 1.1228, verified 3-seed)
Run 1: PR openai#414 + LeakyReLU(0.5)^2 (1 line change)

Baseline against verified numbers, not claimed scores from open PRs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Builds on Run 1 (PR openai#414 + LeakyReLU). Adds:
- temperature param to eval_val_sliding (default 1.0, no change)
- After main eval, sweeps T={0.95,0.96,0.97,0.98,0.99}
- PR openai#576 reported T=0.98 gives -0.003 bpb for free

10 lines added over Run 1. Zero training cost.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Builds on Run 2. Changes from PR openai#414 base:
- MLP expansion: 3.0x → 3.5x (1536 → 1792 hidden, more params)
- Quantization: int6 → int5 (clip_range 31→15, fits more params)
- QAT: enabled with threshold 0.5 (early start, matching PR openai#576)
- QAT uses quantile(0.9995) clip instead of row max
- BigramHash: 2048 → 8192 buckets

From PR openai#576's "Train Larger, Quantize Harder" approach (1.1164 bpb).
8 lines changed from Run 2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Template includes:
- README.md with placeholder results table
- submission.json with schema matching existing PRs
- submit.sh helper to collect logs and extract metrics

Fill in after successful runs, rename folder, PR to upstream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In-Place TTT: loss INCREASES (2.63+), 955s+ eval time. Harmful.
GradQuant int5/int6 mix: 34KB over 16MB even without int7.
PR openai#486 baseline reproduced at 1.1249 (within seed variance of 1.1233).

Added lessons 13-16 to CLAUDE.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#414 hardcodes `from flash_attn_interface import ...` (FA3/Hopper only).
This pod has FA2 but not FA3. Added try/except + SDPA fallback in attention.
Applied to all 4 runs (0-3).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pod has flash_attn 2.8.3 (from flash_attn import flash_attn_func)
but NOT flash_attn_interface (FA3/Hopper). Added cascading import.

Also keeping SDPA fallback for environments with no flash_attn at all.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 0: PR openai#549 UNMODIFIED (merged SOTA 1.1194, verified 3-seed)
Run 1: PR openai#549 + TTT_ENABLED=1 + TTT_LR=0.0005 (2 lines changed)

Both have FA3→FA2→SDPA fallback for non-Hopper GPUs.
Following retro: one change per run, baseline first.

Expected: Run 1 should achieve ~1.094-1.104 (beats 1.1144 target).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upgrades TTT from PR openai#549's weak 3ep SGD (-0.0025 bpb) to PR openai#481's
proven AdamW 30ep cosine + per-layer LR recipe (expected -0.01 to -0.025).

Changes:
- train_gpt.py: Added _ttt_run_phase() + ttt_adapt() + TTT hyperparams
- run_3seeds.sh: Added TTT env vars for 3-seed validation
- finalize_submission.py: Extracts pre/post TTT metrics from logs
- README.md + submission.json: Updated for TTT-enabled submission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents "tensor does not have a device" error when torch.compile
tries to recompile after TTT modified model weights.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#549 SOTA base + PR openai#481 AdamW TTT recipe. Replaces weak 3ep SGD
TTT with 30ep cosine decay + per-layer LR (mlp.proj 3x, mlp.fc 0.5x).
3-seed mean: 1.0705 (std 0.0009). All artifacts under 16MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_bytes

PR openai#771 was listed as "0 seeds" in the competition tracker because
submission.json was missing the required `seeds` and `track` fields,
and used `bytes_total` instead of the expected `artifact_bytes` field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hanced n-gram

- train_gpt_v10_safe.py: v9a + Hedge Mixer (multiplicative weights) + add-delta n-gram smoothing, dim=512
- train_gpt_v10_moonshot.py: model_dim=640 (42M params) + adaptive quant (ternary MLP / int4 attn / int6 embed) + Hedge Mixer
- auto_experiment.py: local CPU random search over 20 configs, logs to experiments.jsonl
- submit.sh: packaging and staging script for H100 runs
- PLAN.md: strategy doc with size estimates and run order

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- validate_configs.py: CPU-only artifact size estimator for moonshot configs (no GPU/data needed)
- experiments.jsonl: 20 initial random search results from auto_experiment.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10 moonshot: ternary MLP quant + scaled model + hedge mixer + enhanced n-gram
3-seed mean 0.8609 bpb (42→0.8600, 1337→0.8611, 2025→0.8616).
All artifacts under 16MB. 11-gram n-gram cache with entropy-adaptive
alpha and Hedge Mixer on PR openai#549 base architecture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi and others added 2 commits March 27, 2026 08:47
3-seed mean 0.8609 bpb (42→0.8600, 1337→0.8611, 2025→0.8616).
All artifacts under 16MB. 11-gram n-gram cache with entropy-adaptive
alpha and Hedge Mixer on PR openai#549 base architecture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add comprehensive experiment tracking and moonshot submissions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant